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Abstract—Peer-to-peer networks have been quite thoroughly 
measured over the past years, however it is interesting to note that 
the BitTorrent Mainline DHT has received very little attention 
even though it is by far the largest of currently active overlay 
systems, as our results show. As Mainline DHT differs from other 
systems, existing measurement methodologies are not appropriate 
for studying it. In this paper we present an efficient methodology 
for estimating the number of active users in the network. We have 
identified an omission in previous methodologies used to measure 
the size of the network and our methodology corrects this. Our 
method is based on modeling crawling inaccuracies as a Bernoulli 
process. It guarantees a very accurate estimation and is able to 
provide the estimate in about 5 seconds. Through experiments 
in controlled situations, we demonstrate the accuracy of our 
method and show the causes of the inaccuracies in previous work, 
by reproducing the incorrect results. 


report on the results from our measurements which have been 
going on for almost 2.5 years and are the first long-term study 
of Mainline DHT. 


I. INTRODUCTION 


There have been many measurements done on peer-to- 
peer networks in general and BitTorrent in particular (see 
Section VI for a detailed comparison) over the past years. 
However, most of the recent studies, e.g., [1]-[3], have focused 
on smaller networks like KAD and Vuze and we are aware of 
only two other studies on Mainline DHT [4], [5], but as we 
discuss in Section VI, their estimation methods omit a crucial 
parameter and thus yield highly inaccurate results. 

Because of this, we developed a more accurate method for 
measuring the number of nodes in Mainline DHT. Although 
we use Mainline DHT as our test case, Our methodology) 
equally applies to any measurement of a large-scale system 
based on sampling a part of the system and scaling up to obtain 
the number of nodes in the systém. Existing studies incorrectly 
assumed that combining the results of a small number of 
samples will yield a good estimate of the network size. Our 
work shows that although the estimate from combined samples 
is better than an estimate from a single sample, errors on 
the order of tens of percents in the size of the network still 
persist. Our methodology fixes this error and yields much more 
accurate estimates. 

In order to demonstrate the higher accuracy of our method, 
we perform extensive validation experiments in a controlled 
environment and show that previously presented methods yield 
incorrect results. Through an iterative tweaking of previous 
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methods, we show what the causes of the errors are and that 
by fixing those issues, correct results can be obtained. 
Additionally, we obtain a comprehensive picture of users 


i Ma. which is ankenia 
Although our focus is not on presenting the actual 


measurement data from Mainline DHT, we will briefly present 
the main findings regarding the number of nodes and the churn 
patterns, since previous reports on these have been inaccurate. 

The contributions of this paper are as follows: 

1) We identify a systematic error in previous works mea- 
suring the number of nodes in DHT-based BitTorrent 
networks (e.g., Mainline DHT, KAD, Vuze) and present 
the cause behind it. 

We develop an efficient and accurate methodology called 
Correction Factor for measuring the size of Mainline 
DHT. Our methodology gives an accurate estimate of 
the system size in less than 5 seconds. The methodology 
is based on modeling the inaccuracies of the crawling 
as a Bernoulli process. 

We validate our methodology and justify our claims 
about the inaccuracies of previous works by performing 
extensive comparison and validation in a controlled 
environment, confirming our claims. 

Applying the methodology to Mainline DHT over a 
period of more than 2 years, we discover that the number 


2) 


3) 


4) 


i There was 


as 


Many of our findings are very different from most of the 
results previously presented about BitTorrent-like systems. As 
we later show, these differences are mainly due to two factors. 
First, previous studies either used different methodologies 
or studied different systems; these will naturally produce 
different results. Second, some of the previous studies have 
used inaccurate methods leading to incorrect results. Our 
methodology fixes the inaccuracies and therefore gives more 
accurate results. 

We would like to point out that our methodology aims at 
estimating the size of the network, although as we show in 
Section V it has other applications as well. The inaccuracies 
in previous work relate only to their estimates of the size 
of the network; other parameters such as session lengths, 
shared content, etc., are unaffected by the omission in their 
measurement methodology. 
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This paper is structured as follows. Section II discusses the 
merits of different measurement methodologies. In Section II, 
we show our crawler design and give details on our mea- 
surement methodology. In Section IV, we show how we set 
up our monitor system, and discuss how crawler performance 
affects results. In Section V we show how our measurement 
methodology can also be used to discover Sybil-attacks in the 
system. We discuss related work in Section VI, and conclude 
our paper in Section VII. 


II. SYSTEMS AND MEASUREMENTS 


Measuring peer-to-peer (P2P) networks and in particular 
BitTorrent has been very popular in the networking community 
over the last decade. Measurement methods can be divided into 
different categories either based on the system or methodology 
used. In this section, we first present an overview of Mainline 
DHT and compare it with other DHT-based systems. We then 
give an overview of measurement methodologies and discuss 
their pros and cons. 


A. Mainline DHT 


Our focus in measuring Mainline DHT (MLDHT) is on 


obtaining a system level view of the network. MLDHT is 
l, i 


in is 


160-bitlonglandliSMMOLPERSIStENy i.e., every time a node joins 


the s it wi ate a ID on the fly. Thusfi® 
e a a 


S. 


MLDHT is the largest P2P system today with 15-27 million 
Cia e:e of its popularity in the real- 
world, many modern P2P softwares support the MLDHT 
protocol and it has in fact evolved into an ecosystem [6]. This 
evolution means that a erees oe ere 


ion, and therefore obtaining a system level 
view is paramount to a better understanding of this ecosystem. 
There is implementation in the Bit- 
Torrent world, namely Vuzé)DHT [7]. Even though both are 
based on Kademlia [8], they are 
as protocols. Viize has|béenjestimatedstoshayeyaboutyymillion 
users [9], [10], but as we show, MLDHT has 10 to 20 times 
as many users, making it a more important network. 


B. Methodologies 


We classify existing BitTorrent measurement methodologies 
in two high-level categories: tracker and DHT-based. These 
can be further refined into sub-categories as described below. 
Table I shows an overview of the sub-categories and respective 
advantages and disadvantages. In this section, we focus on 
the differences in methodologies and return to contrasting our 
results with related work more closely in Section VI. 


Tracker-based measurements can be divided into three sub- 
categories: 


Instrumenting a client — 


e Using tracker logs 
Researchswithwinstrumentedwelients, e.g., [6], [11]-[13], 


Because joi d 


using instrumented clients is thesrisk»of obtainingyaybiased 
measurement, since only data from users who have specifically 
i d. Existing studies 


typically do not address this issue of possible bias. 


 Swarm-basedgmeasurement in general focuses on a single- 
dithatswarm. Monitoring can happen eithenwithiinstramientsdiiÐ 


clients (who need to be a part of the swarm = ini 


client sees. Swarm-based measurement is appropriate when 


investigatin rticular swarm (or similar swarms), but 
is Pe D For 
example, client behavior in a swarm for a popular movie 
is going to be very different from clients in an unpopular 
swarm for an electronic book. Measuring session lengths 


— te 


In both swarm-based measurement and instrumented client 


measurements, (Here iS also the iiskithat the measurement is 
ss. This is because GD 


s [14], [15] and 


Besides swarm-based research, there are also some work 


based on afialyzing tracker logs)[6], [16]-[18]. Tracker-based 
measurement gives a broader View than SWarm-based measure - 
ment, since a tracker typically hosts many swarms. However, 
even popular trackers have a biased view of the system as a 
whole. For example, Chinese users represent a large fraction of 
BitTorrent users, but mainly use trackers inside China which 
are not covered by popular international trackers. As Zhang 
et al. [18] show, there are a 10t of private irackers tin ise, 
m 

tracker-based analysis. 

‘DHT-based — 

There are three popular DHT-based systems, KAD, Vuze, 
and MLDHT. They are all based on the Kademlia DHT, 
however MLDHT is different from the other two, requiring 
a slightly adapted measurement methodology. Measurements 
of KAD and Vuze have been very popular recently [1]-[3], [9], 
[10], [17], [19]-[21]. The main difference between the systems 
is that 

ts. 


Steiner et al. [2], [19] state that they were able to crawl 
an 8-bit zone in KAD network within 2.5 seconds, and a full 
crawl in 8 minutes. This is possible because of the persistent 
IDs which means that any IDs collected during previous crawls 
can be reused in subsequent crawls. In MLDHT with dynamic 
IDs, this is not possible and thus MLDHT requires more time 
to crawl a similar zone. 
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Method Advantages 


Instrumented client | Direct access to user behavior 


Data comes only from instrumented clients, 


[6], [11]J-{13] 


not all clients 
Biased by client selection 


warm monitoring ocuses on content Results only applicable to that swarm 
Biased by swarm selection 


Tracker [ee 
Fast comprehensive crawls possible 
System-level view 


Fast crawls possible 
System-level view 


eae DHT 


Dynamic DHT 


Focuses on content Biased by tracker selection [eel ey | [16]-[18] 
Better coverage than single swarms 


Studying session times difficult [1]- ace [9], [10], [17], 
[19]-[21] 
Content monitoring difficult 


Content monitoring difficult 


TABLE I: Classification of measurement methodologies for BitTorrent-like systems 


In general, the methods;forspersistentiand dynamic DHT IDs 
are similar, i.e., crawling a specific zone or injecting sybils into 
the system. General rules of thumb dictate thatthe duration) 


of a crawl should be as short as possible to obtain a good 


However, _——__ m on MLDHT, KAD, and Vuze 
all contain a methodological omission. They failytoytake into 
on node probe (see Section I-D) which 


means that their results are incorrect. By failing to account for 


the nodes missed in a crawl, 


s our results show, for MLDHT this 
error would be around several tens of percents. 


C. Background on MLDHT 


We now present a short overview of how MLDHT operates, 
as this is fundamental to the design of our measurement 


system. a system. quae) 
toygetymetajinformationsfirst, In standard BitTorrent, dhelmeta 
information can be obtained from the torrent file, which also 
contains a list of centralized trackers to help a peer get the 
initial peer set to bootstrap the download. 


Partly due to legal issues, but also based on improving 
the service availability and system robustness, distributed® 


. BitTorrent hasewoliidependenh 


i i istri i ions, even though 
both are based on the Kademlia . One is( VUZE [7] 


and the other is 


dia. In MLDHT, both 


. Content IDs are also known as 
A peer uses this infohash 


initialypeer|setMLDHT supports ¢6ur Control messages? 
1) PING: probe a node’s availability. If the node fails to 


3) 
4) 


a 
swarm. 


Figure 1 illustrates normal operation in MLDHT. Suppose 
we have 3 nodes A, B and C. A holds a file with infohash 


Peer 11 


Peer 95 A (Peer 29) 


Peer 33 


C (Peer 8 BT Protocol 


Peer 78 Peer 36 


Peer 71 Peer 43 


B (Peer 57) 


Fig. 1: Normal operation of MLDHT 


x = 59. Assume B is responsible for storing x its peer set. 
Node C wants to download the file. 

First, A publishes the file by storing x at B. A calls 
GET_PEERS iteratively to get closer and closer to B, and 
finally reaches it. Then, A uses ANNOUNCE_PEER to tell B 
he is sharing a file with infohash x. B stores A’s contact 
information in the corresponding peer set for x. Since A is 
the publisher, it is the only one in the peer set at the moment. 

When A sends the GET_PEERS messages, two possibilities 
emerge. If the queried node knows this infohash already and 
stores some peers in the corresponding peer set, it will respond 
with the peer set. If it does not know the infohash, it will 
respond with the k closest nodes to the infohash in its routing 
table. In such a way, A will get closer and closer to B, and 
finally reach B and the search finishes. FIND_NODE, which 
we use in our work, behaves the same way, except that x in 
this case represents a node instead of content. 

For C to download the file, it should get x first. It will pro- 
ceed exactly the same as A did before by using GET_PEERS 
to approach B. Since B already saved the peer set for x, C can 
obtain the initial peer set from B. C joins the swarm, sets up 
connections to the peers in peer set, and gets metadata (torrent- 
file) from other peers using BitTorrent extension protocols 
[22], [23]. Then the download process starts. In our work, 
we do not consider the download process. 
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II. METHODOLOGY 


The i 
n-bit zone, which means that e 


We have decided to scale up 


EE NEN ed We will first focus 
on evaluating different zone sizes and then derive a bound 
for the error in our method. The difference between our work 
and the previous ones is we have well-defined methodology 
to validate our experiment design and the results. 


A. Assumptions 
This kind of measurement method is based on the — A 


ae the MLDHT protocol should guarantee a uniform 


disttibiitiontiofinode’s™D. However, 
CEI) since D 


and the abuse of certain IDs has already been 
i as Steiner et al. reported in [2]. 
We carefully examined a large set of samples (over 32000) 


crawled from the different parts of MLDHT, and found the 
node IDs follow a uniform distribution. We did observe some 
abused IDs such as ID 0, but they only contribute a trivial 
amount of nodes to the whole MLDHT, and can be safely 
neglected in our case. 


B. Choosing a Zone 
We 


en we tiyjtojoolleep 


To bootstrap this process, we maintainvayset of active nodes 


injayFIFOsqueue, Then we perform a 


Since the node IDs are assumed to be distributed uniformly 
in the ID space, there should be no correlation among session 
length, content and other factors. We have extensively sampled 
many different parts of the ID space and have observed the 
assumption about uniformly distributed IDs to hold. 

Figure 2 shows the number of nodes in different n-bit zones 
in one sample. Recall that nodes in an n-bit zone share an n-bit 
prefix. We can see the curve exhibits stable behavior between 
5 bit-zone and 25-bit zone. 

Then we have to answer the first questions Which zone we) 

For the first question, although Using) 
adage Zone (Small n) would seem attractive, this method is 
he reason is as the zone size increases) 


i , which is also revealed by Memon et al. 


# of nodes 


20 


10 15 

n-bit zone 
Fig. 2: Number of nodes discovered by our crawler in different 
n-bit zones. There were about 20 million nodes in the system 
when the experiment was performed. With 5-bit zone, the 
crawler reached its performance limit and many nodes were 
missed. Beyond a 24-bit zone, the node density is so sparse 
that the crawler cannot find any other nodes except itself. 


is inevitable that some nodes will be missed in each crawl. If 
we choose alWéery Small Zone((large n), thé!Slightsfluctuation 
caused by the missing nodes will be magnified after scaling 
estimate. The huge fluctuation can be amortized over a large- 


Sa 
By testing different zone sizes, i 
e it achieves 


atie, sinc 
trade-off between the overheads and accuracy. We can finish 


We must point out 12-bit zone is not the only choice;any zone ) 

; ' : a 
The only difference between different zone sizes is that they 
have different correction factors (See Section III-D). 


C. Scaling Up 


The second ES we need to answer is: Can we safely 


(Size? The answer is in fa simply because(a crawler) 


be missed. This question is overlooked by most of the previous 
work. Most related work simply scales up the number of nodes, 
but as we show in this paper, this leads to an incorrect result. It 
is because of this incorrect scaling that we know the previous 
methodologies to contain an error. 

In [2], Steiner et al. set up two crawlers at different vantage 


in [3] when they were crawling KAD. At the same times lots) points and combine the views in an attempt to get a better 


Furthermore, because crawling a large Zone takes along time, 
leading to hard-to-e8timateverrors. In this paper, our aim is to 


obtain an estimate quickly and accurately with low overheads, 
hence a large zone is not appropriate. 


However, a 
, which will be discussed below, it 


estimate. They mentioned that one crawler sometimes saw 
fewer nodes than the other, and they blamed this on network 
connectivity. In such cases, they just merged two node sets. 
However, there is a subtlety hidden here, since they did not 
explicitly mention whether one node set is a proper subset of 
the other or whether “fewer” simply means “less in number” 
but does not take a stand on the overlap between the sets. 
Although such a combination of views is legitimate, two views 
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Fig. 3: Number of nodes in merged samples Fig. 4: The distribution of 500 estimations for p 


is not enough to get a correct estimate, which is well illustrated To this end, we devised an experiment) First, weqnsertjayy) 
by Figure 3. special"ID, say x, into the zone to be crawled. Second, we 


In our study, we consider the missing node issue as a let the crawler keep crawling this zone repeatedly to obtain | 
Systematic Measurement error, and model the whole process as @ series of samples. By counting inq@iowsimany\samplesia > 
a Bernoulli process. We present how we tackle this problem in @ppeats)we can estimate the|valule of p. > 

Section I-D. In order to justify the need for this approach, we It is obvious that the more simultaneous samples we have, 
performed an experiment in which 20 simultaneous samples the more accurate the estimation of p will be. Sovin practice,» 
are crawled within the same 12-bit zone. Each sample’s 12- wevinsert SO @Dsyinto a specific 12-bit zone, and)start multiple) 
bit zone contains about 4000 nodes. Figure 3 plots how the Crawlers simultaneously. In such a way, the measurement 
number of distinct nodes increases as we merge the samples efficiency can be improved significantly, because we can get 
one by one. As more samples are merged, the number of 50 estimations of p in a single experiment. We ignored they) 


distinct nodes increases logarithmically. i 
always well below 1% of the actual IDs in this zone. Then 


s.“As the we generate another 50 random IDs and repeat the experiment 
again. In total, we carried out 

It. As an example, considering the method estimati We ran Jarque—Bera! test on our estimations, 
from [2] where they combine two samples, they would extract the null hypothesis was accepted with a significance value 
about 4000 distinct nodes, when we know from combining 20 @ = 0.05. Figure 4 shows the distribution of these estimations 
samples that there are at least 5500 distinct nodes in the zone. With normal eee a Ts i 

Using these numbers and scaling them up to get an estimate » the average 

of the whole network size would yield an error of 37.5%. (ple deviation is 010280, so the 95% confidence interval is 


(0.7716, 0.8836]. In other words, let (be! the actual popu 


1 


D. Correction Factor 


connectivity, congestion, different implementations, peer’s ab- We also investigated the coverage in 13-16 bit zones, for 
normal=behaviorsetey Previous works typically attempt to example in a 13-bit zone, the average pis 0.8343, and sample 
prove the accuracy of their measurement by showing the deviation is 0.0266. The results derived from different zones 
small variations of samples without considering this issue. . : 
However, small variation cannot guarantee the accuracy if the To verify our result for p, we ran two parallel experiments. 
We collected several samples in a 12-bit zone and combined 

ere REE eee A the samples. After dropping duplicates, the number of unique 
CED; when it is crawling a 12-bit zone. Then nodes increases as shown in Figure 3. Because the result 
; R converges, we can use this to estimate p. We matched this 


k estimate with the estimate of p given by the Bernoulli process 


a described above and found out that they match. This serves as 


nE. validation of our estimate of p. 


? 


confidence. We call this multiplier Correction Factor (CF Jand™® 


with the corresponding correction factor are similar. 


pe ee er rer 


o] drawn from normal distribution(skewness and excess kurtosis are both zero)”. 


'Jarque—Bera test is a test of normality. The null hypothesis is “the data is 
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We must point out that p is subject to many factors like — 


(times Thus the value of p i iodi in 
l. Howevergfors 


e. We will show how we 
tackle this problem in designing our monitor system in Section 
IV-A. 


E. Validation of Methodology 


We validated each step of our methodology from the as- 
sumptions to the final outcome. Our repeated samplings con- 
firm that the assumption about IDs being uniformly distributed 
is justified; results in Section III-A demonstrate this. 

In order to model the sampling as a Bernoulli process, 
we assumed that the crawler will always miss some nodes 
in a crawl. This assumption is verified by the results in 
Sections II-B and I-D and Figure 4. The same results also 
verify the accuracy of our estimate for p. Furthermore, we 
performed a controlled emulation experiment described below 
on our cluster to obtain a ground truth against which we can 
compare our results. 

The best way 


To this end, we designed and performed an emulation using 


The cluster consists of 240 Dell 
PowerEdge M610 nodes and each node is equipped with 2 
quad-core CPUs, 32GB memory, and connected to 10-Gbit 
network. All the nodes run Ubuntu SMP with 2.6.32 kernel. 

To make the experiment setting more in line with the real 
world, we used thiee different MEDHTT client implementations 
in our experiments: Mainline BT, Aria and libtorrent. Each 
of these is a popular client used currently on MLDHT, and 
because they use the same DHT, they are compatible at the 
protocol level. We also tested a heterogeneous situation by 
mixing different fractions of each clients together. In each 
experiment, we deployed one million MLDHT nodes on 200 
machines, i.e., 5000 instances per machine. Figure 5 shows 
the value of p obtained by our crawler for four different traffic 
mixes using either 10-bit or 12-bit zones. 

For each p value, experiments were repeated 50 times and 
arithmetic mean is used. Table II shows a more detailed view 
of the parameters, including p, its standard deviation, and the 
95% confidence interval for the correction factor. As we show 
later, running a large number of parallel samples will allow to 
reduce the confidence intervals to arbitrarily small. 

As we can see from the results, there is no significant 
difference in p value in different mixtures, even though a 
network with only Mainline BT clients always has a slightly 
higher value. Given the consistency of the results, we can 
consider that 
EE erate pn rari A 10-bit zone’s p value is 
always smaller than a 12-bit zone’s, but corresponding stdev 
is also smaller. This result is also consistent with our previous 
discussion that a larger zone provides a more stable estimation 
but suffers from lower coverage. 


(5345) 10-bit zone 
HE 12-bit zone 


1:0:0 8:1:1 6:2:2 4:3:3 
Mix Percent (MLBT:ARIA:LIBT) 
Fig. 5: 10-bit and 12-bit zone’s p value as a function of 
different mix percents of three applications. stdev is small 
therefore omitted from the figure. 10-bit zone’s stdev is consis- 
tently smaller than 12-bit zone’s. (MLBT: Mainline BitTorrent; 
ARIA: Aria; LIBT: libtorrent) 


We also 
ent. (As discussed in more 


detail in Section IV-D and in [12], firewalled nodes can 
represent a significant fraction of the nodes in the network.) 
within 15 minutes. 


We are not so concerned with i 
, since iti 


firewalled nodes can still enter and stay in other 

i . Because various 
BitTorrent (protocol) level operations can trigger the nodes 
being inserted into the routing table. For example, if a node 


ri 


i 
However, 


information in its routing table, 


initi i . We such a situation 
by s (< 30%) @ffirewalled nodes» 
s. 


i : 
giving 
us confidence that our methodology does not suffer from the 
presence of a large fraction of firewalled nodes. 


F. Implications 


The above experiment has several implications regarding 
our measurement methodology and large-scale system mea- 
surements in general. First, the results of the controlled exper- 
iment show that our methodology, in particular the correction 
factor, are indeed correct and a vital component of a measure- 
ment framework. Our crawler is shown to work efficiently 
in realistic scenarios and to provide accurate estimates of the 
network size. Second, 


In practice, 


dhejactualykyclosestieighbors, which results in the need to 


have the correction factor. Third, 


. We 
observed differences of 2—4% in value of p between 10-bit and 
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10-bit 
stdev 


Corr. Factor 


12-bit 


stdev Corr. Factor 


App Percent (%) 
MLBT | ARIA | LIBT p 
0 


100 0 0.8215 | 0.0173 


80 [1010] 0.8045] 001 


(1.1681, o 0.8437 


7 T. Ha T264) 
0.029 04 


Od © 0.8 


REN 
o a a o GOLL) | 03501 | 0.0943 


0.8109 0.01950 


(1.1780, 1.2938) 


(1.1254.1.2726) 


TABLE II: 10-bit and 12-bit zone’s p value with 95% confidence interval as a function of different mix percents of three 
applications. (MLBT: Mainline BitTorrent; ARIA: Aria; LIBT: libtorrent) 


Simultaneous samples rr H e H H 
I6 25402 


Without CF 
A A EE 77 ? 


40.68% 24.62% 12.61% 481% OIG 7.82% 


TABLE III: Estimation with and without correction factor. The numbers report the estimated size of the network when running 


n simultaneous samples. (CF: Correction Factor) 


12-bit zones in our controlled test, whereas we have observed 


differences of 6-9% in the real MLDHT. Asther zone gets) 


larger, p decreases which makes the accuracy of the estimate 

. Scaling 
the node density without considering the correction factor is 
guaranteed to underestimate the network size, but there is 
no way to estimate how much lower it is. Our methodology 
eliminates this problem. 


The correction 


factor strikes a trade-off, by allowing a much lower use of 
measurement resources and still obtaining the same level of 
accuracy as a large number of simultaneous samples would 
provide. Table II shows the network size estimated by using 
a different number of simultaneous parallel samplers with and 
without the correction factor. As we see, 


Naturally, as shown in Table III multiple 
parallel samplers improve the accuracy of our method as well, 
although the error between | and 20 parallel samplers is only 
on the order of 5%. However, correction factor does not give 
an answer for exact causes of missing node issue which can 
be manifold. Finding those causes will be our future work. 


IV. EXPERIMENTS 


We now present the implementation of our crawler and 
discuss practical aspects related to data collection. 


A. System Architecture 


Figure 6 shows the four principal components of our system. 
An efficient Crawler lies at the core of the whole system. 
It can finish a crawl within 5 seconds, trying to gather 
as many nodes as possible in a target zone. Beneath the 
crawler, the Maintainer component maintains a set of over 
3000 active nodes, and randomly provides 100 nodes among 
these long-lived nodes to bootstrap the crawler. The Injector 
component is responsible for injecting controlled nodes into 
the monitored target zone. Then the sample will be sent to 


MLDHT Monitor System 


Visualizer 


Analyzer 


Fig. 6: Principal components of monitoring system 


Crawler 


Analyzer component, where the p value will be calculated 
by checking the occurrence of controlled nodes. Finally, the 
estimate and other relevant information will be stored in the 
database and visualized by Visualizer component. 

As mentioned before, for each crawl, obtaining several 
hundreds of simultaneous samples to calculate p is expensive. 
So i 


b The formula we 
s ; 


B. Deployment 


We use two nodes for sampling, both Dual Intel Xeon E5440 
@ 2.83GHz with quad cores, 32 GB memory and Gigabit 
connection to the Internet. The operating system is Debian 
SMP with Linux 2.6 kernel. On each node, we set ip a crawler 
with its own a policy. One is puke’ [Xp g 
which a À e 


between the two crawlers. 
There are two reasons for setting up a pair of crawlers. 
The first reason is to prevent the sample gaps due to the 


13-th IEEE International Conference on Peer-to-Peer Computing 


application failure. The second is to assess the accuracy of 
the captured data by cross-correlating the samples from two 
parallel measurements. 


pares ar cea) We have had only a few small 
gaps in the collection process until now. The duration for 
capturing a snapshot varies within 5 seconds, depends on the 
network size at that time. On average, each sample contains 
about 20,000 distinct IDs from different zones. The we 
have collected for our crawler are available 
or other researchers. Please see http://www.cs.helsinki.fi/u/ 
jakangas/MLDHT/ for more details. 


C. Duplicated IDs 


Basically, we only need to handle Wo types of duplicates. 


The first is the case of . We 
coun? such multiple records as one HOdE)since it is 
resulélofa Client listening On malple pore. The other one is 
the case of sameliDybutidifferentIP) There are €WOIpOSSibilities 
for this kind of duplicates. The first is pure collision, which 
is véryifare and thé)sécond, isidue\to | modifiediclients. In both 
cases, we count them as one node by selecting one of the 
IPs randomly. We consider this acceptable, since such cases 


contribute a negligible part in a sample. 


D. Non-responding Nodes 


source library for BitTorrent protocol. Several popular BitTor- 
rent clients (e.g. Deluge, LimeWire, rTorrent) are developed 


with this library, and it is also used in Soak research oe 


s. We 
avoided unnecessary tweaks. T 


e ny SO aa 


Furthermore, the 
difference in the estimate of the network size between [4] and 
our work is about a factor of 6, leading us to conjecture that the 
difference in the results is largely due to the implementation 
of the crawler. In the second and third version, we tuned 
some parameters and also tweaked the code in /ibtorrent to 
improve the crawling efficiency. Even though the tuned version 
can crawl faster and discover more nodes within a zone, the 
reported value is still only 1/4 to 1/3 of the actual value, which 
is far from accurate. 

We therefore looked into libtorrent code and carefully ex- 


amined its design. Mhelreasontforjinefficientperformancerisithe 


n 


Non-responding nodes refer to the nodes who fail to answer) «deteriorate rapidly) For a crawler, the most used method is 


s. Onewpossibility is 
aifirewall, another possibility is 


network and the routing information is stale. Unfortunately 


iwo casès. On average, non-responding nodes constitute about ` 


30% of ournodes, and this percentage remains quite stable in 
all the samples. Other studies, e.g., [12], have found similar 


numbers for firewalled nodes ee oe 


be purged in about 15 minutes on average, but this depends — 


Gnithelactiallimplementationjofitheyclient We have seen that 


ls. However, a more thorough study 
would be needed 


As presented in Section III- F, v we tested the performance of 
our crawler in a controlled environment with around one third 


of the nodes being behi ll. As our results showed, 
nd we do 


not need to consider any special procedures for them. 


E. Crawler Performance Issues 


As [17], [24] pointed out before, a crawler must be well 
designed and carefully tuned. This is not trivial but is critical 
for measurement accuracy. To save time and reduce devel- 
opment complexity, some previous work, e.g., [4], developed 
their crawler from the third party library, and used it in the 
experiments. However, those libraries are usually intended for 
general use as parts of normal clients and are not specialized 
for measurements. 


obtaining a node set from a specific zone, so speed of the 
crawler is the most important factor considering the conver- 
aon speed. 


t. 
We then further modifed our librorreni test case andaddediap 
éorrection Modüle into the libtorrent crawler. This correction 


module consisted of the r components 


from our own crawler. pag 2 ps we were Able t 


E After we applied our 
correction factor (Section III-D), the results were consistent 


with our own MLDHT crawler. This test confirms that a 
correct and efficient implementation of a crawler is vital to 
getting accurate measurements. 


F MLDHT Evolution 


Figure 7 shows the number of nodes in MLDHT during 
one week in March 2011, March 2012, and April 2013. The 
number of users shown, (246/27 million at peak, is typical 
ar ET T GONE A D 
The churn is mainly generated by European, in particular East 
European users; details are not shown due to space constraints. 
The weeks shown are typical for the years they depict and 
the roughly 10% growth from 2011 to 2012 happened mostly 
gradually, although there was a marked increase in Fall of 
2011. From 2012 to 2013, the size of the network has remained 
roughly stable. 
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Fig. 7: One week 2011, 2012, and 2013. The weeks show the typical daily churn, mainly caused by East European users. The 
typical number of users increased by about 10% from 2011 to 2012, but has remained stable since then. 
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(b) During a Sybil-Attack 


Fig. 8: Evolution of three system metrics in a day, from top 
Correction Factor, RTT, and node density. Values have been 
normalized to the average value over that day. 


V. CORRECTION FACTOR AND ANOMALY DETECTION 


j . Therefore Gt Can alsolbeused las) 
Figure 8a 


shows the evolution of three system metrics: correction factor, 
average RTT (Round-Trip Time), and node density over 24 
hours. For the ease of comparison, the values have been 
normalized to the average value of that day. From the figure, 
we can easily see that correction factor is the only metric 
which remains stable throughout the whole day. 


Gacor increases. This is expected, since 


However, s 
igni r. In our previous 
work [27], we reported a real-world Sybil-attack in MLDHT 
on Jan. 6, 2011. The attack was from two virtual machines 
of Amazon EC2, and started from 6:00 am. Figure 8 presents 
evolution of three metrics captured by our monitor system 


during the attack. From Figure 8b, we can see — 4 
aqdyitarevoyeredtonthesnormakdevek atter Telataiekceasel, 
At the same time, there is no noticeable changes in the other 


two metrics. For detailed analysis of these attacks in MLDHT, 
please refer to [27]. 


The reason for this increase is that in a Sybil-attack like 
this, the aaNEES Sea CaS eA 
networks(or part of it). Our sampling process is affected by 
this and SaaS RETR Oe PESTS 


decreases (see Section I-D). Correspondingly, ahe Gorrection > 
‘the attack increases — 


; the 
increase is necessary to obtain the correct estimate of the size 
of the network. 


VI. RELATED WORK 


There have been a lot of measurement work on different P2P 
networks, such as [1]-[3], [9], [10], [17], [19]-[21], [24], [27], 
but most of them studied KAD or Vuze DHT, and only [4], 
[5], [27] are MLDHT related. In [27], Wang et al. studied 
two major Sybil-attacks in MLDHT and reported large-scale 
anomalies in the real-world system by their honeypot design. 

In [25], Kostoulas et al. gave a thorough and general survey 
on various techniques for group size estimation of large-scale 
distributed systems. In [3], Memon et al. monitored 32000 
peers in KAD using a single PC. They intercepted most of 
the traffic to the monitored nodes. Their crawler Montra is 
introduced in [24] where they also extensively discuss practical 
issues related to crawling P2P networks. In [17], Stutzbach et 
al. also point out several pitfalls that can cause significant bias 
in the sample. 

In [1], [2], [19], Steiner et al. used crawler Blizzard to study 
KAD. Their work showed China is the biggest country in 
KAD, and they also showed the popularity of KAD in Europe. 
Our findings mostly concur with these results. In [10], Steiner 
et al. crawled Azureus DHT by exploiting REQUEST_STATS 
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and REPLY_STATS messages in Azureus DHT protocol. 
They found out that there are 3—4.5 million users in KAD, 
followed by Azureus with about | million users. According to 
our research, there are over 27 million users in MLDHT (at 
peak time). We adapted our sampling method from their work 
to be able to handle the much larger MLDHT network. Their 
original sampling method suffers from the missing node issue 
(Section I-D). 

[11], [21] focused on content and publishing activities 
on popular P2P networks. Zhang et al. carried out a thor- 
ough study on BitTorrent ecosystem in [6]. Their method is 
monitoring large amount of swarms from popular BitTorrent 
portals. As mentioned in [6], this can cause bias in the 
measurement since some countries, e.g., China and India, are 
underrepresented. In [26], Iosup et al. tried to produce an 
accurate geo-snapshot for BitTorrent network by using trace 
files from Supernova.org. Their work, besides being rather old, 
suffers from the same limitations as [6] since it ignores all 
users not using those particular sites. 

In [4], Junemann et al. did very similar work as ours. 
However, there is big difference in the estimate of the network 
size. Their estimate is about 5 to 7 million, whereas ours 
can be over 27 million. We suspect the reason is that they 
adopted a third-party plugin, libtorrent, in their implementa- 
tion for ing the DHT. Si 


anodes and other aspecis. As a result, the node density is 


severely underestimated. Our method remedies this problem 
and in addition also allows us to tell the exact estimation 
error. A similar problem also exists in [5] where they estimate 
MLDHT size based on a modified plugin from Vuze. The 
key reason behind the difference in the results is that both of 
these works ignore the missing node issue. As we showed in 
Section IV-E, crawler performance is one of the key factors 
in getting a right estimate and plain libtorrent is not sufficient 
for actual measurement work. 


VII. CONCLUSION 


In this paper, we have developed a fast and accurate method 
for estimating the number of nodes in the BitTorrent Mainline 
DHT network. We have identified the missing node problem 
as a key omission in previous work and show how to fix this 
via modeling the crawling as a Bernoulli process. Our method 
provides much more accurate results and is able to run in about 
5 seconds. Our correction factor can also be used to identify 
Sybil-attacks in the system. 

We have validated our methodology by taking previously 
developed measurement methodologies and shown in a con- 
trolled environment that they lead to an incorrect estimate in 
the number of nodes. We also show that practical crawler 
implementation issues can easily lead to large errors. 

Concerning the actual number of nodes in the system, our 
results show that the number varies between 15 and 27 million 
per day with a very clear daily churn pattern. European users 
dominate both in terms of number of users and over the course 
of the past 30 months of our study we have seen that the 
number of users increased by about 10% from 2011 to 2012, 
but has remained stable since then. 
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